The bible, truth, and multilingual OCR evaluation
Identifieur interne : 002027 ( Main/Exploration ); précédent : 002026; suivant : 002028The bible, truth, and multilingual OCR evaluation
Auteurs : T. Kanungo [États-Unis, Japon] ; P. Resnik [Japon, États-Unis]Source :
- SPIE proceedings series [ 1017-2653 ] ; 1999.
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Recherche documentaire, Multilinguisme.
English descriptors
- KwdEn :
Abstract
Multilingual OCR has emerged as an important information technology, thanks to the increasing need for cross-language information access. While many research groups and companies have developed OCR algorithms for various languages, it is difficult to compare the performance of these OCR algorithms across languages. This difficulty arises because most evaluation methodologies rely on the use of a document image dataset in each of these languages and it is difficult to find document datasets in different languages that are similar in content, layout, and fonts. In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora with similar properties such the Koran and the Bhagavad Gita. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000821
- to stream PascalFrancis, to step Curation: 000B73
- to stream PascalFrancis, to step Checkpoint: 000755
- to stream Main, to step Merge: 002136
- to stream Main, to step Curation: 002027
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">The bible, truth, and multilingual OCR evaluation</title>
<author><name sortKey="Kanungo, T" sort="Kanungo, T" uniqKey="Kanungo T" first="T." last="Kanungo">T. Kanungo</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Center for Automation Research, University of Maryland</s1>
<s2>College Park, MD 20742 </s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
<settlement type="city">College Park (Maryland)</settlement>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Institute for Advanced Computer Studies, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<placeName><settlement type="city">College Park (Maryland)</settlement>
<region type="state">Maryland</region>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
</author>
<author><name sortKey="Resnik, P" sort="Resnik, P" uniqKey="Resnik P" first="P." last="Resnik">P. Resnik</name>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Institute for Advanced Computer Studies, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<placeName><settlement type="city">College Park (Maryland)</settlement>
<region type="state">Maryland</region>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
<affiliation wicri:level="4"><inist:fA14 i1="03"><s1>Department of Linguistics, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
<settlement type="city">College Park (Maryland)</settlement>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">99-0297905</idno>
<date when="1999">1999</date>
<idno type="stanalyst">PASCAL 99-0297905 INIST</idno>
<idno type="RBID">Pascal:99-0297905</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000821</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000B73</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000755</idno>
<idno type="wicri:doubleKey">1017-2653:1999:Kanungo T:the:bible:truth</idno>
<idno type="wicri:Area/Main/Merge">002136</idno>
<idno type="wicri:Area/Main/Curation">002027</idno>
<idno type="wicri:Area/Main/Exploration">002027</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">The bible, truth, and multilingual OCR evaluation</title>
<author><name sortKey="Kanungo, T" sort="Kanungo, T" uniqKey="Kanungo T" first="T." last="Kanungo">T. Kanungo</name>
<affiliation wicri:level="4"><inist:fA14 i1="01"><s1>Center for Automation Research, University of Maryland</s1>
<s2>College Park, MD 20742 </s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
<settlement type="city">College Park (Maryland)</settlement>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Institute for Advanced Computer Studies, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<placeName><settlement type="city">College Park (Maryland)</settlement>
<region type="state">Maryland</region>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
</author>
<author><name sortKey="Resnik, P" sort="Resnik, P" uniqKey="Resnik P" first="P." last="Resnik">P. Resnik</name>
<affiliation wicri:level="4"><inist:fA14 i1="02"><s1>Institute for Advanced Computer Studies, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<placeName><settlement type="city">College Park (Maryland)</settlement>
<region type="state">Maryland</region>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
<affiliation wicri:level="4"><inist:fA14 i1="03"><s1>Department of Linguistics, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
<settlement type="city">College Park (Maryland)</settlement>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint><date when="1999">1999</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Document analysis</term>
<term>Document image processing</term>
<term>Document retrieval</term>
<term>Multilingualism</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Traitement image document</term>
<term>Reconnaissance optique caractère</term>
<term>Reconnaissance forme</term>
<term>Recherche documentaire</term>
<term>Analyse documentaire</term>
<term>Multilinguisme</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Recherche documentaire</term>
<term>Multilinguisme</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Multilingual OCR has emerged as an important information technology, thanks to the increasing need for cross-language information access. While many research groups and companies have developed OCR algorithms for various languages, it is difficult to compare the performance of these OCR algorithms across languages. This difficulty arises because most evaluation methodologies rely on the use of a document image dataset in each of these languages and it is difficult to find document datasets in different languages that are similar in content, layout, and fonts. In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora with similar properties such the Koran and the Bhagavad Gita. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.</div>
</front>
</TEI>
<affiliations><list><country><li>Japon</li>
<li>États-Unis</li>
</country>
<region><li>Maryland</li>
</region>
<settlement><li>College Park (Maryland)</li>
</settlement>
<orgName><li>Université du Maryland</li>
</orgName>
</list>
<tree><country name="États-Unis"><region name="Maryland"><name sortKey="Kanungo, T" sort="Kanungo, T" uniqKey="Kanungo T" first="T." last="Kanungo">T. Kanungo</name>
</region>
<name sortKey="Resnik, P" sort="Resnik, P" uniqKey="Resnik P" first="P." last="Resnik">P. Resnik</name>
</country>
<country name="Japon"><region name="Maryland"><name sortKey="Kanungo, T" sort="Kanungo, T" uniqKey="Kanungo T" first="T." last="Kanungo">T. Kanungo</name>
</region>
<name sortKey="Resnik, P" sort="Resnik, P" uniqKey="Resnik P" first="P." last="Resnik">P. Resnik</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002027 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002027 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:99-0297905 |texte= The bible, truth, and multilingual OCR evaluation }}
This area was generated with Dilib version V0.6.32. |